Skip to content

Conversation

@suhaibmujahid
Copy link
Member

These improvements could be reviewed commit by commit.

We can have better tracking with W&B Weave
Introduces a run_by_diff_id method that retrieves a patch by diff ID from review_data and runs the review process.
Eliminated the abstract version property from GenerativeModelTool and removed the version attribute from CodeReviewTool since it is not used.
Moved suggestion filtering logic from code_review/agent.py to a new suggestion_filtering module. Introduced SuggestionFilteringTool for filtering review comments, updated CodeReviewTool to use the new filterer, and relocated related prompt templates. This improves modularity and separation of concerns for suggestion filtering.
Wrapped comments and rejected examples in <comments-to-filter> and <rejected-examples> tags to improve prompt structure and clarity.
It will be replaced with W&B Weave evaluation pipeline
@suhaibmujahid suhaibmujahid marked this pull request as ready for review January 10, 2026 22:34
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the code review evaluation infrastructure by replacing old script-based evaluation tools with a more modular architecture and W&B Weave integration for tracking evaluations.

Changes:

  • Removes legacy evaluation scripts (code_review_tool_evaluator.py, code_review_tool_evaluator_report.py) and experimental files
  • Introduces new modular tools for patch summarization, suggestion filtering, and comment matching
  • Adds Jupyter notebooks for dataset creation and evaluation using W&B Weave
  • Refactors CodeReviewTool to use Protocol-based dependency injection for better testability
  • Updates platform base classes to accept both str and int for patch_id parameters

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
scripts/code_review_tool_evaluator_report.py Removed legacy evaluation report generator
scripts/code_review_tool_evaluator.py Removed legacy evaluation script (613 lines)
experiments/review_helper_modify_filtering_step.ipy Removed experimental filtering modification script
requirements.txt Added weave>=0.50.0 for evaluation tracking
notebooks/code_review_evaluation.ipynb New notebook for running W&B Weave evaluations
notebooks/code_review_create_dataset.ipynb New notebook for creating evaluation datasets
bugbug/tools/suggestion_filtering/prompts.py Extracted filtering prompts to dedicated module
bugbug/tools/suggestion_filtering/agent.py New modular suggestion filtering tool
bugbug/tools/patch_summarization/prompts.py Extracted summarization prompts to dedicated module
bugbug/tools/patch_summarization/agent.py New modular patch summarization tool
bugbug/tools/comment_matching/prompts.py New prompts for LLM-based comment matching
bugbug/tools/comment_matching/agent.py New tool for matching generated vs ground truth comments
bugbug/tools/code_review/scorer.py New Weave scorers for evaluation metrics
bugbug/tools/code_review/utils.py Refactored to work with structured comment objects
bugbug/tools/code_review/prompts.py Removed prompts moved to specialized modules
bugbug/tools/code_review/agent.py Refactored to use Protocol-based dependencies
bugbug/tools/base.py Simplified by removing version property and print method
bugbug/tools/core/platforms/base.py Updated signature to accept str or int for patch_id
bugbug/tools/core/platforms/phabricator.py Updated signature to accept str or int for patch_id
bugbug/tools/core/platforms/swarm.py Updated signature to accept str or int for patch_id
bugbug/code_search/searchfox_api.py Made get_file parameter optional with default implementation
bugbug/code_search/mozilla.py Made get_file parameter optional with fallback

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

system_prompt=SYSTEM_PROMPT,
response_format=ProviderStrategy(AgentResponse),
)

Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The create method is missing the @staticmethod decorator. This method should be decorated with @staticmethod to match the pattern used in other similar factory methods in the codebase (e.g., PatchSummarizationTool.create(), SuggestionFilteringTool.create()).

Suggested change
@staticmethod

Copilot uses AI. Check for mistakes.
from pydantic import BaseModel, Field

from bugbug.tools.base import GenerativeModelTool
from bugbug.tools.code_review.agent import GeneratedReviewComment
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential circular import issue: this module imports GeneratedReviewComment from bugbug.tools.code_review.agent, while bugbug.tools.code_review.agent imports SuggestionFilteringTool from this module. This circular dependency could cause import errors. Consider moving GeneratedReviewComment to a shared data types module or using TYPE_CHECKING with forward references.

Copilot uses AI. Check for mistakes.

return result["structured_response"].comments

def run_by_diff_id(self, diff_id: str) -> list[InlineComment] | None:
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter type for diff_id should be str | int to match the signature of get_patch_by_id which was updated in the base class to accept both str and int. This inconsistency could cause type checking issues.

Suggested change
def run_by_diff_id(self, diff_id: str) -> list[InlineComment] | None:
def run_by_diff_id(self, diff_id: str | int) -> list[InlineComment] | None:

Copilot uses AI. Check for mistakes.
from bugbug.tools.patch_summarization.prompts import PROMPT_TEMPLATE_SUMMARIZATION


class PatchSummarizationTool(GenerativeModelTool):
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PatchSummarizationTool class is missing proper license header comments. While the file has the MPL license header, the class itself should maintain consistency with other similar tool classes that have docstrings explaining their purpose.

Suggested change
class PatchSummarizationTool(GenerativeModelTool):
class PatchSummarizationTool(GenerativeModelTool):
"""Tool for generating natural-language summaries of code patches.
This tool uses a chat-based large language model to summarize a given
patch set in the context of the target software, producing a concise
explanation suitable for reviewers or other stakeholders.
"""

Copilot uses AI. Check for mistakes.

def run(
self, old_comments: list[MatchingComment], new_comments: list[MatchingComment]
) -> list[CommentMatch]:
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run method lacks a docstring explaining its parameters and return value. This is especially important for public API methods. Similar methods in other tool classes (e.g., SuggestionFilteringTool.run) have comprehensive docstrings.

Suggested change
) -> list[CommentMatch]:
) -> list[CommentMatch]:
"""
Find matching code review comments between an old and a new set.
This method sends the provided comments to the underlying language model
agent, which identifies pairs of comments that correspond to each other
across the two versions.
Parameters
----------
old_comments:
A list of :class:`MatchingComment` instances representing comments
from the old set (e.g., an earlier revision of a code review).
new_comments:
A list of :class:`MatchingComment` instances representing comments
from the new set (e.g., a later revision of the same code review).
Returns
-------
list[CommentMatch]
A list of :class:`CommentMatch` objects, each describing a matched
pair of comments between the old and new sets. If either
``old_comments`` or ``new_comments`` is empty, an empty list is
returned.
"""

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,25 @@
PROMPT_TEMPLATE_SUMMARIZATION = """You are an expert reviewer for {target_software}, with experience on source code reviews.
Please, analyze the code provided and report a summarization about the new changes; for that, focus on the coded added represented by lines that start with "+".
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the prompt text. "the coded added" should be "the code added".

Suggested change
Please, analyze the code provided and report a summarization about the new changes; for that, focus on the coded added represented by lines that start with "+".
Please, analyze the code provided and report a summarization about the new changes; for that, focus on the code added represented by lines that start with "+".

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,62 @@
from langchain.agents import create_agent
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new package directory is missing an __init__.py file. Python packages require an __init__.py file to be properly importable. Other package directories in the codebase (e.g., bugbug/tools/code_review/) include this file.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,48 @@
from langchain.agents import create_agent
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new package directory is missing an __init__.py file. Python packages require an __init__.py file to be properly importable. Other package directories in the codebase (e.g., bugbug/tools/code_review/) include this file.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,156 @@
# -*- coding: utf-8 -*-
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new package directory is missing an __init__.py file. Python packages require an __init__.py file to be properly importable. Other package directories in the codebase (e.g., bugbug/tools/code_review/) include this file.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +41
def create(**kwargs):
if "llm" not in kwargs:
from bugbug.tools.core.llms import create_openai_llm

kwargs["llm"] = create_openai_llm()

return CommentMatchingTool(**kwargs)
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal methods should have at least one parameter (the first of which should be 'self').

Suggested change
def create(**kwargs):
if "llm" not in kwargs:
from bugbug.tools.core.llms import create_openai_llm
kwargs["llm"] = create_openai_llm()
return CommentMatchingTool(**kwargs)
@classmethod
def create(cls, **kwargs):
if "llm" not in kwargs:
from bugbug.tools.core.llms import create_openai_llm
kwargs["llm"] = create_openai_llm()
return cls(**kwargs)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant